The present invention is related to word-breakers. More particularly, the present invention is related to new word extraction or collection methods for use in word-breaking.
Word identification or word-breaking is an important component of natural language processing applications that process textual inputs. In particular, word-breaking is important in most search engines. The search engines perform word-breaking on input strings for several purposes. For example, word-breaking is applied to input strings to determine component words of a compound word.
Word identification or word-breaking is an especially important task for search engines while processing languages, such as Chinese, which have no blank spaces between words. Such languages, which are sometimes referred to as agglutinative languages, include Chinese, Japanese and Korean, for example. An agglutinative language is a language in which words are made up of a linear sequence of distinct morphemes, and each component of meaning is represented by its own morpheme. Other examples of agglutinative languages include Sumerian, Hourrite, Ourartou, Basque and Turkish. Generally, in agglutinative languages, words can be compounded without spaces separating the component words.
In languages such as Chinese, word-breaking is typically implemented by searching for nouns. However, these nouns may be new words which do not exist in the original dictionaries or lexicons used by the word-breaker. When this occurs, the word-breaker cannot properly identify words from web pages and user queries. This in turn causes a lower precision rate in the search results.
Collecting new words for a custom lexicon used by the word-breaker is an endless task. Existing techniques for collecting the new words for the custom lexicon are time consuming and burdensome. Typically, new words are manually collected by search engine developers for addition to the custom lexicon used by that search engine. New words are also manually collected by developers for inclusion in the next product generation's system dictionary. The time consuming and labor intensive nature of these new word collection techniques leaves much to be desired.